Skip to content

Conversation

@jglick
Copy link
Member

@jglick jglick commented Dec 12, 2017

Saw a support bundle that showed a thread inside

"jenkins.util.Timer [#3] / waiting for JNLP4-connect connection from …" …
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at hudson.remoting.Request.call(Request.java:165)
    - locked <0x…> (a hudson.remoting.UserRequest)
    at hudson.remoting.Channel.call(Channel.java:903)
    at hudson.FilePath.act(FilePath.java:987)
    at hudson.FilePath.act(FilePath.java:976)
    at hudson.FilePath.deleteRecursive(FilePath.java:1178)
    at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:187)
    at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:326)
    at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279)
    at …

From that it was unclear whether this one call to FilePath.deleteRecursive was in fact taking more than the 10s allotted, or if there were lots of repeated calls to run/check, or what. This patch

  • Ensures that even if some code swallows InterruptedException, a new signal will be sent.
  • Makes sure something appears in the system log when a thread is blowing past its timeout.

I tried to also set the thread name to indicate how stalled it is, to allow this information to appear in thread dumps, but could not get it to work—whether due to a JVM bug or some oversight on my part, I am not sure.

@reviewbybees

@jglick jglick requested review from abayer and svanoort December 12, 2017 23:27
@oleg-nenashev
Copy link
Member

From that it was unclear whether this one call to FilePath.deleteRecursive was in fact taking more than the 10s allotted, or if there were lots of repeated calls to run/check, or wha

Object hashes were same across 9 threaddumps with a 5-sec interval, so most likely this is the first case

abayer
abayer previously approved these changes Dec 13, 2017
Copy link
Member

@abayer abayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, but I'd like @svanoort's thoughts.

@jglick
Copy link
Member Author

jglick commented Dec 13, 2017

Object hashes were same

Which object hashes? UserRequest? If so, that suggests that Channel.call is ignoring thread interruptions. I do not see that happening but maybe I am missing something.

Anyway, @reviewbybees done for now.

@jglick
Copy link
Member Author

jglick commented Dec 14, 2017

By the way before implementing this (so using the original Timeout behavior) I did some crude testing: ran ExecutorStepTest.buildShellScriptAcrossRestart, which launches a JNLP agent and runs a sh on it, but modified to sleep longer to give me time to intervene; while the process was running, STOPped the agent JVM so it would cease to respond; and after getting the expected

Cannot contact dumbo: java.lang.InterruptedException

message, waited. The timeout was enforced to the extent that individual calls to check in fact ran for just 10s. The build continued to wait for a response (checking repeatedly), as is currently designed—Pipeline does not treat a hung or even crashed agent JVM as fatal, since the same JVM or a replacement could come back and resume status updates about the external process. (I am not sure if @svanoort has plans to change this behavior or offer alternative durability strategies.) I then interrupted the build through the UI and it correctly ABORTED after printing

Could not connect to dumbo to send interrupt signal to process

with no further human intervention.

Could probably develop this into an automated test in workflow-durable-task-step with a little work, though I am more interested in finding scenarios where recovery does not happen as expected. In other words the repeated interruption portion of this patch is a speculative fix, not something I was able to demonstrate is necessary; the enhanced logging could in the future be used to rule out possible explanations for hangs.

svanoort
svanoort previously approved these changes Dec 14, 2017
Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong feeling one way or the other -- it's hard to tell what the impact will be until we see it in the wild. I'm slightly nervous about using Timer threads more, given issues we've seen with the bounded threadpool, but also trying harder to kill threads seems smart too?

The interactions with the pipeline threading model are hard to predict with precision.

I do think we want to think hard about simplifying the pipeline threading model in the mid-term future because the chain of causality has become rather esoteric.

@jglick I have no plans to modify that in durability as of yet.

@abayer
Copy link
Member

abayer commented Dec 14, 2017

Fwiw, I think this is a problem that shows up on ci.jenkins.io pretty often, so a fix is nice to see.

@jglick
Copy link
Member Author

jglick commented Dec 14, 2017

it's hard to tell what the impact will be until we see it in the wild

Correct.

I'm slightly nervous about using Timer threads more

Not sure what you mean. Timeout was already using Timer threads to deliver interrupts. This patch could cause more tasks to be scheduled (if something else is indeed very wrong), but each one should run essentially instantaneously, and nothing waits for it.

think hard about simplifying the pipeline threading model

If you have any suggestions let me know, but note that jenkinsci/workflow-durable-task-step-plugin#21 would eliminate most of the checks discussed here (I left in a 5m ping to allow recovery from certain corner cases).

this is a problem that shows up on ci.jenkins.io pretty often

If you have the ability to track down particular instances and investigate, that would be important work. I for one have no permissions on the server.

…as a support bundle showed that all Timer threads were occupied.
@jglick jglick dismissed stale reviews from abayer and svanoort December 14, 2017 21:12

stale

@jglick
Copy link
Member Author

jglick commented Dec 14, 2017

taking more than the 10s allotted

Turns it out was taking longer than 10s—but no interruption was being delivered, because the other nine Timer threads were also occupied! Oops.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐝 nice catch

@jglick
Copy link
Member Author

jglick commented Dec 14, 2017

@reviewbybees done FWIW

@jglick
Copy link
Member Author

jglick commented Dec 14, 2017

BTW the actual case of the agent-side hang was believed to be JENKINS-39179 in this case.

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, I like it with the new modification. Yeah, this is some good stuff.

@abayer abayer merged commit 5ea4e13 into jenkinsci:master Dec 18, 2017
@jglick jglick deleted the Timeout branch December 18, 2017 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants